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Abstract 

In this paper we introduce a dynamic programming algorithm to perform linear text 
segmentation by global minimization of a segmentation cost function which consists of: 

(a) within-segment word similarity and (b) prior information about segment length. The 
evaluation of the segmentation accuracy of the algorithm on a text collection consisting 
of Greek texts showed that the algorithm achieves high segmentation accuracy and 
appears to be very innovating and promissing. 

Keywords: Text Segmentation, Document Retrieval, Information Retrieval, Machine 
Learning. 

1 Introduction 

Text segmentation is an important problem in information retrieval. Its goal is the 
division of a text into homogeneous ('lexically coherent') segments, i.e. segments ex- 
hibiting the following properties: (a) each segment deals with a particular subject and 

(b) contiguous segments deal with different subjects. Those segments can be retrieved 
from a large database of unformatted (or loosely formatted) text as being relevant to a 
query. 

This paper presents a dynamic programming algorithm which performs linear seg- 
mentation 1 by global minimization of a segmentation cost. The segmentation cost is 
defined by a function consisting of two factors: (a) within-segment word similarity and 
(b) prior information about segment length. Our algorithm has the advantage that it 
can be applied to either large texts - to segment them into their constituent parts (e.g. 
to segment an article into sections) - or to a stream of independent, concatenated texts 
(e.g. to segment a transcript of news into separate stories). 

For the calculation of the segment homogeneity (or alternatively heterogeneity) of 
a text, several segmentation algorithms using a variety of criteria have been proposed 
in the literature. Some of those use linguistic criteria such as cue phrases, punctuation 
marks, prosodic features, reference, syntax and lexical attraction f [Tl 1101 H^] b Others, 
following Halliday and Hasan's theory (0), utilize statistical similarity measures such as 
word cooccurrence. For example the linear discourse segmentation algorithm proposed 
by Morris and Hirst f [ltjj^ is based on lexical cohesion relations determined by use of 
Roget's thesaurus ([23). In the same direction Kozima's algorithm (^]E1) computes 
the semantic similarity between words using a semantic network constructed from a 
subset of the Longman Dictionary of Contemporary English. Local minima of the 
similarity scores correspond to the positions of topic boundaries in the text. 

Youmans ([SOI) and later Hearst (0E]) focused on the similarity between adjacent 
part of texts. They used a sliding window of text and plotted the number of first-used 

1 As opposed to hierarchical segmentation ( |28p 



words in the window as a function of the window position within the text. In this plot, 
segment boundaries correspond to deep valleys followed by sharp upturns. Kan (|TT]) 
expanded the same idea by combining word-usage with visual layout information. 

On the other hand, other researchers focused on the similarity between all parts of a 
text. A graphical representation of this similarity is a dotplot. Reynar ( (231231) an d Choi 
(E3E1) used dotplots in conjunction with divisive clustering (which can be seen as a form 
of approximate and local optimization) to perform linear text segmentation. A relevant 
work has been proposed by Yaari (p!8 ) who used divisive / agglomerative clustering 
to perform hierarchical segmentation. Another approach to clustering performs exact 
and global optimization by dynamic programming; this was used by Ponte and Croft 
f |21M27| 1. Heinonen (||]) and Utiyama and Isahara f |26|1. 

Finally, other researchers use probabilistic approaches to text segmentation includ- 
ing the use of Hidden Markov Models ([2H|, 0)- Also Beeferman (U) calculated the 
probability distribution on segment boundaries by utilizing word usage statistics, cue 
words and several other features. 

2 The algorithm 

2.1 Representation 

Suppose that a text contains T sentences and its vocabulary contains L distinct words 
(e.g words that are not included in the stop list, otherwise most sentences would be 
similar to most others). This text can be represented by a T x L matrix F defined as 
follows: for t = 1, 2, T and 1=1,2,...,! we set 

f 1 iff 1-th word is in t-th sentence 
Ft < l = \ else. 

The sentence similarity matrix of the text is a T x T matrix D where for s,t — 
1,2, ...,T we set 




1 if Ef=i F.,lFt,l > 0; 
if T,t=i F ',l F *,i = °- 



This means that D s ^t — 1 if the s-th and i-th sentence have at least one word in common. 
Every part of the original text corresponds to a submatrix of D. It is expected that 
submatrices which correspond to actual segments will have many sentences with words 
in common, thus will contain many 'ones'. In Figure 1 we give a dotplot of a matrix 
corrpesponding to a 91-sentences text. Ones are plotted as black squares and zeros as 
white squares. Further justification for the use of this similarity matrix and graphical 
representation can be found in 120], C2j 1221123]) 13] an d IH 

We make the assumption that segment boundaries always occur at the end of sen- 
tences. A segmentation of a text is a partition of the set {1, 2, T} into K subsets (i.e. 
segments, where K is a variable number) of the form {1, 2, ii}, {<i + l,t\ + 2, i2j-, 

{tx-i + 1, tx-i + 2, T} and can be represented by a vector t = (to, t\, tx), 
where to, ti, ■■■,tK are the segment boundaries corresponding to the last sentence of each 
subset. 

2.2 Dynamic Programming 

Dynamic programming as a method guarantees the optimality of the result with respect 
to the input and the parameters. Following the approach of Heinonen ([S]) we use a 
dynamic programming algorithm which decides the locations of the segment boundaries 
by calculating the globally optimal splitting t on the basis of a similarity matrix (or 
a curve), a preferred fragment length and a cost function defined. Given a similarity 
matrix D and the parameters fi, a , r and 7 (the role of each of which will be described 
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in the sequel) the dynamic programming algorithm tries to minimize a segmentation 
cost function J(t; /i, a, r, 7) with respect to t (here t is the independent variable which 
is actually a vector specifying the boundary position of each segment and the number 
of segments K while \i, a, r, 7 are parameters) which is defined in equation (1). 

J{t\^a,r n )=Y. k =il- 2 J -(1-7) \t k -t k .y [) 

Hence the sum of the costs of the K segments constitutes the total segmentation 
cost; the cost of each segment is the sum of the following two terms (with their relative 
importance weighted by the parameter 7): 

1. The term ^ tk * 2 fc CT 2 ^ corresponds to the length information measured as the 
deviation from the average segment length. In this sense, \x and a can be consid- 
ered as the mean and standard deviation of segment length measured either on 
the basis of words or on the basis of sentences appearing in the text's segments 
and can be estimated from training data. 

2. The term —^-^77^3^7-%^-^ — - corresponds to (word) similarity between sen- 
tences. The numerator of this term is the total number of ones in the D submatrix 
corresponding to the fc-th segment. In the case where the parameter r is equal to 
2, (tfc — tk-i) r correspond to the area of submatrix and the above fraction cor- 
responds to 'segment density'. A 'generalized density' is obtained when r ^ 2 
and enables us to control the degree of influence of the surface with regard to the 
'information' (i.e. the number of ones) included in it. Strong intra-segment simi- 
larity (as measured by the number of words which are common between sentences 

£ '— 1 X) '— x D 3 t 

belonging to the segment) is indicated by large values of — °~ k ~\ tk -t k ~ Jy 1 

, irrespective of the exact value of r. 

Segments with high density and small deviation from average segment length (i.e. a 
small value of the corresponding J(t; /1, u, r, 7) 2 ) provide a 'good' segmentation vector 
t. The global minimum of J(t; /i, a, r, 7) provides the optimal segmentation t. It is 
worth mentioning that the optimal t specifics both the optimal number of segments K 
and the optimal positions of the segment boundaries to, h, tx. In the sequel, our 
algorithm is presented in a form of pseudocode. 

Dynamic Programming Algorithm 



Input: The T x T similarity matrix D; the parameters fi, a, r, 7. 
Initilization 

Fort = 1,2,..T 
Sum = 

For s = 1,2,..., t- 1 
Sum = Sum+ D Sjt 

End 

q Sum 

&8,t - jt=7y 

End 

Minimization 

Co = 0, Z = 
Fort = 1,2,,T 
C t = 00 

For s = 1,2,..., t- 1 



2 Small in the algebraic sense; J(t; fx, a, r, 7) can take both positive and negative values. 
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if c s + s s , t + (t - 2 s -/ } < c t 

r — c 4- <? -i- (^-'j-A') 2 

Of — O s -t- S! t i 

Z t = s 
Endlf 

End 

End 

Backtracking 

K = , s fe = T 
While > 

fc = fc + 1 

s fe = Z Sfc _ 1 

End 

# = K + 1, Z fe = 0, t = 
For k = 1,2, 
tfc = sjf-t 

End 

Output: The optimal segmentation vector t = (to,ti, ...,tx)- 



3 Evaluation 

3.1 Measures of Segmentation Accuracy 

The performance of our algorithm was evaluated by three indices: Precision, Recall 
and Bceferman's Pk metric. Precision and Recall measure segmentation accuracy. For 
the segmentation task, Precision is defined as 'the number of the estimated segment 
boundaries which are actual segment boundaries' divided by 'the number of the esti- 
mated segment boundaries'. On the other hand, Recall is defined as 'the number of 
the estimated segment boundaries which are actual segment boundaries' divided by 
'the number of the true segment boundaries'. High segmentation accuracy is indicated 
by high values of both Precision and Recall. However, these two indices have some 
shortcomings. First, high Precision can be obtained at the expense of low Recall and 
conversely. Additionally, those two indices penalize equally every inaccurately estimated 
segment boundary whether it is near or far from a true segment boundary. 

An alternative measure Pk which overcomes the shortcomings of Precision and Recall 
and measures segmentation inaccuracy was introduced recently by Beeferman et al. 
(PJ). Intuitively, P k measures the proportion of 'sentences which are wrongly predicted 
to belong to the same segment (while actually they belong in different segments)' or 
'sentences which are wrongly predicted to belong to different segments (while actually 
they belong to the same segment)'. Pk is a measure of how well the true and hypothetical 
segmentations agree (with a low value of Pk indicating high accuracy ( 1 )). Pk penalizes 
near-boundary errors less than far-boundary errors. Hence Pk evaluates segmentation 
accuracy more accurately than Precision and Recall. 

3.2 Experiments 

For the experiments, we use a text collection compiled from a corpus comprising of text 
downloaded from the website http://tovima.dolnet.gr of the newspaper entitled 
'To Vima'. This newspaper contains articles belonging to the following categories: 1) 
Editorial, diaries, reportage, politics, international affairs, sport reviews 2) cultural 
supplement 3) Review magazine 4) Business, finance 5) Personal Finance 6) Issue of 
the week 7) Book review supplement 8) Art review supplement 9) Travel supplement. 
Stamatatos et al. ( 25 ) constructed a corpus collecting texts from supplement 2) which 
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includes essays on science, culture, history etc. They selected 10 authors from the above 
set without taking any special criteria into account. Then 30 texts of each author were 
downloaded from the website of the newspaper as shown in the table below: 



Author 


Thematic Area 


Alachiotis 


Biology 


Babiniotis 


Linguistics 


Dcrtilis 


History, Society 


Kiosse 


Archeology 


Liakos 


History, Society 


Maronitis 


Culture, Society 


Ploritis 


Culture , History 


Tassios 


Technology, Society 


Tsukalas 


International Affairs 


Vokos 


Philosophy 



Table 1: 

List of Authors and Thematic Areas dealt by each of those. 

No manual text preprocessing nor text sampling was performed aside from removing 
unnecessary heading irrelevant to the text itself. All the downloaded texts were taken 
from the issues published from 1997 till early 1999 in order to minimize the potential 
change of the personal style of an author over time. Further details can be found in 

Eg. 

The preprocessing of the above texts was made using the morphosyntactic tagger 
(better known as part-of-speech tagger) developed by Giorgos Orphanos The 
aforementioned tagger is a POS tagger for modern Greek (a high inflectional language) 
which is based on a Lexicon capable of assigning full morphosyntactic attributes (i.e. 
POS, Number, Gender, Tense, Voice, Mood and Lemma) to 876.000 Greek word forms. 
This Lexicon was used to build a tagged corpus capable of showing off the behavior of 
all POS ambiguity schemes present in the Modern Greek (e.g. Pronoun-Clitic- Article, 
Pronoun-Clitic, Adjective- Ad verb, Verb-Noun, etc) as well as the characteristics of un- 
known words. This corpus was used for the induction of decision trees, which along 
with the Lexicon are integrated into a robust POS tagger for Modern Greek texts. 

The tagger architecture consists of three parts: the Tokenizer, the Lexicon and finally 
the Disambiguator and Guesser. Raw text passes through the Tokenizer, where it is 
converted to a stream of tokens. Non-word tokens (e.g. punctuation marks, numbers, 
dates etc.) are resolved by the Tokenizer and receive a tag corresponding to their 
category. Word tokens are looked up in the Lexicon and those found receive one or 
more tags. Words with more than one tags and those not found in the Lexicon pass 
through the Disambiguator/Guesser, where the contextually appropriate tag is decided. 

The Disambiguator/Guesser is a 'forest' of decision trees, one tree for each ambi- 
guity scheme present in Modern Greek and one tree for unknown guessing. When a 
word with two or more tags appears, its ambiguity scheme is identified. Then, the 
corresponding decision tree, is selected, which is traversed according to the values of 
the morphosyntactic features extracted from contextual tags. This traversal returns 
the contextually appropriate POS along with its corresponding lemma. The ambiguity 
is resolved by eliminating the tag(s) with different POS than the one returned by the 
decision tree. The POS of an unknown word is guessed by traversing the decision tree 
for unknown words, which examines contextual features along with the word ending 
and capitalization and returns an open class POS and the corresponding lemma. 

3.2.1 Preprocessing 

For the experiments we use the texts taken from the collection compiled from the corpus 
of the newspaper 'To Vima'. Each of the 300 texts of the collection of articles compiled 
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from this newspaper is preprocessed using the POS tagger created by G. Orphanos. 
More specifically, every word in the text was substituted by its lemma, determined by 
the tagger. Punctuation marks, numbers and all words were removed except words that 
are either nouns, verbs, adjectives or adverbs. For those words that their lemma was 
not determined by the tagger, due to the fact that those words were not contained in the 
Lexicon used for the creation of the tagger, no substitution was made and the words were 
used as they were. The only information that was kept was the end of each sentence 
appearing in each text. We next present two suites of experiments. The difference 
between those suites lies in the length of segments created and the number of authors 
used for the creation of the texts to segment, where each text being a concatenation of 
ten text segments. 

3.2.2 1st suite of experiments 

In the first suite of experiments, our collection consists of 6 datasets: SetO, Set5. The 
difference between those datasets lies in the number of authors used for the generation 
of the texts to segment and consequently the number of texts used from the collection. 
The table below contains the aforementioned information. 



Set 


Authors 


No. of texts per set 





Kiosse, Alachiotis 


60 


1 


Kiosse, Maronitis 


60 


2 


Kiosse, Alachiotis, Maronitis 


90 


3 


Kiosse, Alachiotis, Maronitis, Ploritis 


120 


4 


Kiosse, Alachiotis, Maronitis, Ploritis, Vokos 


150 


5 


All Authors 


300 



Table 2: 



List of the sets complied in the 1st suite of experiments and the author's texts used 

for each of those. 

For each of the above sets, we constructed four subsets. The difference between 
those subsets lies in the range of the sentences appearing in each segment for every 
text. If a and b correspond to the lower and higher values of sentences consisting each 
segment, we have used four different pairs: (3,11), (3,5), (6,8) and (9,11). In every 
dataset, before generating any of the texts to segment, each of which containing 10 
segments, we selected the authors, whose texts will be used for this generation. If X is 
the number of authors contributing to the generation of the dataset, for all datasets, 
each text is generating according to the following procedure (which guarantees that 
each text contains ten segments): 

1. For j— 1,2,..., 10 where j corresponds to the j-th out of the 10 segments of the 
generated text. 

2. For 1=1,. ..,X a random integer -corresponding to an author - is generated. 

3. For k=l,2,...,30 a random integer corresponding to the texts belonging to the 
selected author I is generated. 

4. For 1 random integer corresponding to the number of consecutive sen- 
tences extracted from text k (starting at the first sentence of the text) is generated. 

For every subset, using the procedure described above, we generated 50 texts. As 
it was mentioned before, our algorithm uses four parameters /i, cr, 7 and r, where \x and 
a can be interpreted as the average and standard deviation of segment length; it is 
not immediately obvious how to calculate the optimal values of each of parameters. 
A procedure for determining appropriate values of /i, cr, 7 and r was introduced using 
training data and a parameter validation procedure. Then our algorithm is evaluated 
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on (previously unseen) test data. More specifically, for each of the datasets SetO,..., Set5 
and each of their subsets we perform the procedure described in the sequel: 

1. Half of the texts in the dataset are chosen randomly to be used as training texts; 
the rest of the samples are set aside to be used as test texts. 

2. Appropriate \x and a values are determined using all the training texts and the 
standard statistical estimators. 

3. Parameter 7 is set to take the values 0.00, 0.01, 0.02, ... , 0.09, 0.1, 0.2, 0.3, 
... , 1.0 and r to take the values 0.33, 0.5, 0.66,1. This yields 20x4=80 possible 
combinations of 7 and r values. Appropriate 7 and r values are determined by 
running the segmentation algorithm on all the training texts with the 80 possible 
combinations of 7 and r values; the one that yields the minimum Pk value is 
considered to be the optimal (7, r) combination. 

4. The algorithm is applied to the test texts using previously estimated 7, r, \i and 
a values. 

An idea of the influence of 7 and r on Pk of the first suite of experiments can be 
observed in Figures 2-5 (corresponding to subsets '3-11', '3-5', '6-8' and '9-11' of Set5). 
In those figures Exp 1 refers to the first suite of experiments. 

The above procedure is repeated five times for each of the six datasets and the 
resulting values of Precision, Recall and Pk are averaged. The performance of our 
algorithm (as obtained by the validated parameter values) is presented in Table 3. 

3.2.3 2nd Suite of experiments 

In the second suite of experiments we used the same collection of texts compiled from 
the corpus of the newspaper 'To Vima'. The difference between those two suites lies 
in the way of generating the texts used for training and for testing. In this suite of 
experiments, we used all the available (300) texts of the collection of the Greek corpus, 
which means all the available authors. We constructed a single dataset containing 200 
texts. Half of them were used for training while the rest of them were used for testing. 
Each of the aforementioned texts was generated according to the following procedure 
(which guarantees that each text contains 10 segments): 

1. For j=l,2,..., 10 where j corresponds to the j-th out of the 10 segments of the 
generated text. 

2. For 1=1,. ..,X a random integer -corresponding to an author - is generated. 

3. For k=l,2,...,30 a random integer corresponding to the texts belonging to the 
selected author I is generated. The selected text is read and scanned in order 
to determine the number of paragraphs that consists it. If Z is the number of 
paragraphs that consists it then: 

4. For 1 = 1,2,...,Z a random integer- corresponding to the number of paragraphs 
appearing in text k - is generated. 

5. For m = 1,..., Z-l, a random integer - corresponding to the "starting paragraph" - 
is generated. Thus the segment contains all the paragraphs of text k starting from 
paragraph m and ending at the paragraph m+1. 
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1st suite of Experiments 


(3,H) 


(3,5) 


(6,8) 


(9,11) 


SetO Precision 


70,65% 


86,82% 


96,44% 


93,33% 


SetO Recall 


71,11% 


87,11% 


96,44% 


93,33% 


SetO Beeferman 


14,04% 


6,20% 


0,82% 


0,84% 


Setl Precision 


63,86% 


82,98% 


91,11% 


94,67% 


Setl Recall 


67,11% 


83,56% 


91,11% 


94,67% 


Setl Beeferman 


15,82% 


8,47% 


2,81% 


0,98% 


Set2 Precision 


71,14% 


90% 


91,11% 


92,44% 


Set2 Recall 


60,89% 


89,78% 


91,11% 


92,44% 


Set 2 Beeferman 


14,42% 


3,45% 


2,15% 


1,247% 


Set3 Precision 


59,99% 


84,44% 


86,22% 


91,11% 


Set3 Recall 


58,67% 


83,56% 


86,22% 


91,11% 


Set3 Beeferman 


17,93% 


7,36% 


3,28% 


1,45% 


Set4 Precision 


57,99% 


85% 


88,89% 


91,11% 


Set4 Recall 


51,11% 


84,89% 


88,89% 


91,11% 


Set4 Beeferman 


17,38% 


6,76% 


2,65% 


1,39% 


Set5 Precision 


65,74% 


81,56% 


89,33% 


88,89% 


Set5 Recall 


61,78% 


81,78% 


89,33% 


88,89% 


Set 5 Beeferman 


14,54% 


6,49% 


3,57% 


1,86% 



Table 3 

Exp. Suite 1: The Precision, Recall and Beeferman's metric values for the datasets 
SetO, Setl, Set2, Set3, Set4 and Set5, using sentences as a unit of segment, obtained 

by a validation procedure. 

From the aforementioned method of generating texts, it is obvious that, the 200 
generated texts for segmentation are larger - in length - than those generated during 
the first suite of experiments. Thus the segmentation of such texts consists a more 
difficult problem. We used the same validation procedure as before with the same 
values for the parameters r and 7. The obtained validated results are listed in the table 
below: 



2nd suite of Experiments 




Precision 


60,60% 


Recall 


57,00% 


Beeferman 


11,07% 



Table 4 

Exp. Suite 2: The Precision, Recall and Beeferman's metric values for the unique 
dataset using paragraphs as a unit of segment obtained by a validation procedure. 

4 Discussion 

Our algorithm was previously tested on Choi's data collection (^3]), which contains 
english texts, achieving significantly better results than the ones previously reported 
in [3J 0] and [23 • Since the collection used here has not been previously used in the 
literature for the purpose of text segmentation, we cannot provide a direct comparative 
assessment. However, the performance obtained is comparable and in most cases better 
than the corresponding on the Choi's collection, even though, for several cases the 
problem dealt by our algorithm is more difficult. The difficulty lies in the fact that, the 
thematic area dealt by several authors is very similar (see Table 1). One of the reasons 
for the high segmentation accuracy is the robustness of the POS tagger used. We have 
observed that, in general, the tagger fails to find the tag and lemma of very technical 
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words. The use of them as they appear in the original text, does not have a negative 
impact on the segmentation accuracy. The robustness of our algorithm is also indicated 
by the performance obtained at the second suite of experiments where the segment 
length is bigger and the deviation from the average length is high. Even in that case 
our algorithm achieved very high results. This is the result of the combination of the 
following facts: First, the use of the segment length term in the cost function seems to 
improve segmentation accuracy significantly. Second, the use of 'generalized density' 
(r 7^ 2) appears to significantly improve performance. Even though the use of 'true 
density' (r = 2) appears more natural, the best segmentation performance (minimum 
value of Pfc) is achieved for significantly smaller values of r. This performance in most 
cases is improved when using appropriate values of /x, a, 7 and r derived from training 
data and parameter validation. 

Finally, it is worth mentioning that our approach is 'global' in two respects. First, 
sentence similarity is computed globally through the use of the D matrix and dot-plot. 
Second, this global similarity information is also optimized globally by the use of the 
dynamic programming algorithm. This is in contrast with the local optimization of 
global information (used by Choi) and global optimization of local information (used 
by Heinoncn). 

It is worth mentioning that, the computational complexity of our algorithm is com- 
parable to that of the other methods (namely 0(T 2 ) where T is the number of sen- 
tences). Finally, our algorithm has the advantage of automatically determining the 
optimal number of segments. 

5 Conclusion 

We have presented a dynamic programming algorithm which performs text segmentation 
by global minimization of a segmentation cost consisting of two terms: within-segment 
word similarity and prior information about segment length. The performance of our 
algorithm is quite satisfactory considering that it yields a high performance in a text 
collection containing Greek texts. In the future we intent to use other measures of 
sentence similarity. We also plan to apply our algorithm to a wide spectrum of text 
segmentation tasks. We arc interested in segmentation of non artificial rcalife texts, 
texts having a diverse distribution of segment length, long texts, and change-of-topic 
detection in newsfeeds. 
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Figure 1: The similarity matrix D corresponding to a text from the dataset '9-11' 
of Set 5. This text contains 91 sentences, hence D is a 91 x 91 matrix. A black dot 
at position (m,n) indicates that the m-th and n-th sentence have at least one word in 
common. 
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Figure 3:P k for '3-5' of Set5 



0.16c 



■ Exp1r = 1 

Exp1 r=0.66 

x Exp1 r=0.5 

0-14 ] + Exp1 r=0.33 



+ 

'o 



3 * * * $ * 



0.1 0.2 0.3 0.4 0.5 

T 



i 0.7 0.8 

Figure 4:P k for '6-8' of Set5 



■ Exp1r=1 

Expl r=0.66 

x Expl r=0.5 

+ Expl r=0.33 



0.06 o< 



0.04 - . 



■ *\ 

0.02 - * 



^ ® $ $ $ $ 
0.1 0.2 0.3 0.4 



0.5 



« » 

0.6 0.7 0.8 



Figure 5:P k for '9-11' of Set5 



13 



